Genome assembly of Hibiscus sabdariffa L. provides insights into metabolisms of medicinal natural products

Abstract Hibiscus sabdariffa L. is a widely cultivated herbaceous plant with diverse applications in food, tea, fiber, and medicine. In this study, we present a high-quality genome assembly of H. sabdariffa using more than 33 Gb of high-fidelity (HiFi) long-read sequencing data, corresponding to ∼20× depth of the genome. We obtained 3 genome assemblies of H. sabdariffa: 1 primary and 2 partially haplotype-resolved genome assemblies. These genome assemblies exhibit N50 contig lengths of 26.25, 11.96, and 14.50 Mb, with genome coverage of 141.3, 86.0, and 88.6%, respectively. We also utilized 26 Gb of total RNA sequencing data to predict 154k, 79k, and 87k genes in the respective assemblies. The completeness of the primary genome assembly and its predicted genes was confirmed by the benchmarking universal single-copy ortholog analysis with a completeness rate of 99.3%. Based on our high-quality genomic resources, we constructed genetic networks for phenylpropanoid and flavonoid metabolism and identified candidate biosynthetic genes, which are responsible for producing key intermediates of roselle-specific medicinal natural products. Our comprehensive genomic and functional analysis opens avenues for further exploration and application of valuable natural products in H. sabdariffa.


Introduction
Hibiscus, a popular genus in the Malvaceae family, contains more than 200 species distributed in tropical and subtropical regions.Plants in genus Hibiscus are widely known to produce valuable natural products such as fibers, pigments, and medicinal secondary metabolites.As a medicinal plant, the phytochemical and pharmacological properties of Hibiscus sabdariffa have been well studied (Da-Costa-Rocha et al. 2014).The calyx of H. sabdariffa contains various bioactive organic acids, including phenolic acid, hydroxycitric acid, and hibiscus acid.In addition, H. sabdariffa contains anthocyanins and flavonoids such as delphinidin, cyanidin, hibiscetin, gossypetin, and quercetin (Da-Costa-Rocha et al. 2014).These bioactive compounds have been the subject of considerable research interest due to their potential health benefits and therapeutic properties.In H. sabdariffa extract, quercetin and its derivative, quercetin-3-O-glucuronide, have been found to reduce the levels of reactive oxygen species (ROS) induced by glucolipotoxicity.They also reduce triglyceride accumulation, which is critical for antiobesity (Herranz-López et al. 2020).In addition, the roselle extracts showed antihypertensive, antiinflammatory, and antimicrobial effects (Herrera-Arellano et al. 2004;Mardiah et al. 2015;Abdallah 2016).
Revealing the biosynthetic pathways of key natural products responsible for these medicinal effects would facilitate the pharmacological application of H. sabdariffa.Recently, transcriptomics-based approaches have been developed as effective tools for revealing the biosynthesis of complex natural products; biosynthesis of colchicine, benzylacetone, catharanthine, and olive sesquiterpenes were recently identified based on RNA-seq coexpression analysis (Nett et al. 2020;Kang et al. 2022;Stander et al. 2022;Sun et al. 2022b;Alagna et al. 2023;Li et al. 2023).Reliable transcriptome analysis and manipulation of gene expression require highquality genomic resources from the host organism (Hibrand Saint-Oyant et al. 2018;Raymond et al. 2018;Sun et al. 2022b;Li et al. 2023).Moreover, comparative genomics studies also need high-quality genomic resources, which provide critical evidence for the biogenesis of novel natural products.These evidences offer insights into evolutionary aspects and functional diversification (Zhang et al. 2020(Zhang et al. , 2023;;Tang et al. 2022;Sun et al. 2022a;Conart et al. 2023;Wang et al. 2023).
High-quality genomes of Hibiscus mutabilis, Hibiscus syriacus, and Hibiscus trionum have been published (Li et al. 2015;Kim et al. 2017;Yang et al. 2022;Koshimizu et al. 2023), and these genomes have built the fundamental basis for the genomic approaches.However, the available genome for H. sabdariffa is currently limited to a poor-quality draft (JAPQLB000000000.1,GenBank), which poses a challenge for comprehensive genomic and biochemical studies in roselle (Kim et al. 2017;Yang et al. 2022;Koshimizu et al. 2023).Recently, significant progress has been made in longread sequencing technologies, particularly with the introduction of high-fidelity (HiFi) sequencing by Pacific Biosciences.This has

DNA/RNA sequencing
The genomic DNA of H. sabdariffa plant was extracted from mature leaves and sequenced by NICEM (Seoul National University, South Korea) using the following protocol.The extraction method employed was based on the cetyltrimethylammonium bromide (CTAB) method, known for its high-molecular-weight DNA extraction capabilities (Porebski et al. 1997).The extraction procedure involved cell lysis of leaf tissues frozen on liquid nitrogen, DNA purification by the use of polyvinyl pyrrolidone, DNA concentration by vacuum-dry, and sequential elimination of short DNA to ensure the quality and integrity of the extracted genomic DNA.Extracted genomic DNA was qualified with NanoDrop, Quantus, and agarose gel electrophoresis to validate whether its quality is adequate to be sequenced.The PacBio HiFi sequencing libraries were prepared with SMRTbell prep kit 3.0 (Pacific Biosciences, Menlo Park, USA) and were sequenced on the SMRTbell barcoded adapter plate (Pacific Biosciences) on the PacBio Sequel IIe platform (Pacific Bioscience).
Total RNA extraction from mature leaves and RNA sequencing were conducted by Macrogen (Seoul, South Korea).Total RNA was extracted with the RNeasy Plant Mini Kit (Qiagen, Hilden, Germany), and RNase-free DNase I (Qiagen) was used to eliminate its genomic DNA contamination.Quality of RNA was measured with the RNA integrity number score on the Agilent 2100 Bioanalyzer (Agilent Technologies, Santa Clara, USA).The RNA sequencing libraries were generated with the TruSeq Stranded Total RNA with the Ribo-Zero Plant kit (Illumina, San Diego, USA) and were sequenced on the NovaSeq 6000 system (Illumina).

De novo genome assembly
Genome assembly and gene annotation were performed following a protocol paper (Kim and Kim 2022).In general, we used the programming languages R (v4.3.2) and Python (v.3.9.10) and the virtual environment Anaconda3 (v22.9.0) for the following programs.Our HiFi reads were de novo assembled into contigs using the Hifiasm assembler (v0.18.5-r500, cloned from official GitHub repository on 2023 February 14) on defaults options except -s 0.54 (-s represents the similarity threshold).It produced 3 main genome assemblies: a primary genome assembly that contains a complete assembly and 2 partially phased, haplotype-resolved genome assemblies, that is, haplotigs (Hap 1 and Hap 2 in this paper).The resulting graphical fragment assembly (GFA)-formatted genome assembly files were converted into FASTA-formatted files.The primary assembly and the 2 haplotigs were treated individually for all the following analysis.

Repeat and gene annotation
Since there is no available specific repeat library for H. sabdariffa, a de novo repeat library of H. sabdariffa was constructed using RepeatModeler (v2.0.3) with NINJA (NINJA-0.95-cluster_only;Flynn et al. 2020).This library and the known repetitive sequences in a predefined eudicot repeat library were used for RepeatMasker (v4.1.2) to mask repetitive regions of our genome assemblies with default options (Saha et al. 2008).As RepeatMasker predicts repeat regions based on classified repeat libraries and leverages repeat classification models internally, the results of repeat masking by RepeatMasker contained transposable element classification.Hard-masked outputs were utilized for analysis of repetitive regions, and soft-masked outputs were further utilized for gene annotation.
Based on repeat masking results, we found the telomeric repeats in the genome applying "grep" command to find the plant canonical telomeric repeat sequence "TTTAGGG" and its variant forms (Richards and Ausubel 1988;Adamusová et al. 2020).The candidate telomeric regions were inferred by the position and length, i.e. the telomeric repeats placing in the end of contig and longer than 1,000 bp (i.e.∼140 concatemers of telomeric repeat) were considered as candidate telomeres.When the position of telomeric repeats is 5′ end, the sequence should be "CCCTAAA" and its variant forms, and "TTTAGGG" and its variant forms were required for 3′ end.

Evaluation
Qualities of assembled genome and annotated genes were evaluated based on length statistics and benchmarking universal single-copy ortholog (BUSCO) results (Simão et al. 2015).BUSCO (v.5.2.2) analysis was performed on the eudicots_odb10 database to confirm whether assembled genomes and predicted gene sets appropriately contain the evolutionary conserved genes (Simão et al. 2015).The results of BUSCO analysis were plotted by gener-ate_plot.pyscript contained in a BUSCO package.We employed the assembly-stats (v1.0.1) to analyze the length statistics of contigs and genes, including the number, mean length, maximum length, N50 value, and total size (Sjunnebo 2007).The consensus quality (QV) of each genome was assessed by using meryl (v1.3) and Merqury (v1.3;Rhie et al. 2020).LTR Assembly Index (LAI) could be used to assess the continuity of genome assembly in a reference-free manner.The LAI of each genome was evaluated by using LTR_retriever (v2.9) with default options described in an official open-source GitHub repository of it: https://github.com/oushujun/LTR_retriever (Ou et al. 2018;Ou and Jiang 2018).The gene annotation quality was further assessed with the RNA read alignment rate onto genic region by using BEDTools (v2.27.1).The BAM files generated by aligning the RNA-seq data onto each genome in the Repeat and gene annotation section were used for this analysis.

Genetic metabolic network construction
The coding sequences (CDS) of genes predicted in primary and haplotig genomes were functionally annotated by similaritybased sequence search on UniProt Knowledgebase (release 2023.01) using MMseqs2 (v13.4511;Steinegger and Söding 2017; The UniProt Consortium 2017).The major metabolic profile of H. sabdariffa was defined based on PubChem taxonomy data (Supplementary Data 1), the referenced compound-organism pair database, LOTUS (Kim et al. 2021).To construct genetic metabolic network of the major metabolic profile of roselle, we referred the corresponding molecular pathways of the KEGG database (Kanehisa and Goto 2000).Subsequently, we extracted annotations of biosynthetic enzymes involved in the production of key intermediate compounds from pathways in other organisms using the KEGG database (Kanehisa and Goto 2000).To identify potential candidate genes for biosynthetic pathways in H. sabdariffa, we manually compared these annotations of interests to the functional gene annotations of H. sabdariffa genes, constructing a gene set potentially involved in the phenylpropanoid and flavonoid pathways.The expression level of each candidate gene was evaluated by aligning the RNA-seq data to Hap 2 genome with kallisto (0.50.1), which was used for gene annotation (Bray et al. 2016).We have investigated candidate biosynthetic gene clusters in Hap 2 genome by using plantiSMASH (Kautsar et al. 2017).This approach allowed us to narrow down and prioritize the candidate genes for further investigation and characterization in the context of biosynthetic pathways in H. sabdariffa.

Genome assembly and annotation
PacBio HiFi sequencing of H. sabdariffa genomic DNA resulted in 2.3 million reads (total 33.9 Gb); those mean length was 15.0 kb (Table 1 and Fig. 1a).The HiFi sequencing read quality of H. sabdariffa genomic DNA was verified by a Phred quality score (Q score) of Q34.This high-quality score assures the reliability and confidence of the sequencing data, allowing accurate downstream analyses and interpretations of the genomic information of H. sabdariffa.The karyotype of H. sabdariffa is 2n = 4x = 72, and the haploid genome size is 1.67 Gb, i.e. we obtained 20.3× coverage raw reads (Mohammad et al. 2020).The HiFi reads were de novo assembled with Hifiasm and phased into pseudo-haplotigs, Hap 1 and Hap 2 (Cheng et al. 2021).A haplotig represents a set of haplotype-resolved contigs.The primary assembly represented a total length of 2.36 Gb with a contig N50 length of 26.25 Mb and the longest contig measuring 70.47 Mb (Table 1).For the 2 haplotigs, the total length, N50 length, and maximum length of contigs were found to be 1.42 Gb,11.96 Mb,and 60.86 Mb for Hap 1 and 1.48 Gb, 14.50 Mb, and 48.40 Mb for Hap 2, respectively (Table 1).Compare to the previous draft genome of H. sabdariffa, our genome assemblies represented clearly improved metric (Table 1).The coverage of the primary genome, Hap 1 genome, and Hap 2 genome were 141.3, 86.0, and 88.6% of the 1.67-Gb haploid genome.In this study, Hap 2 was used as a reference haplotype of H. sabdariffa genome for SV analysis.
We evaluated each genome with respect to the length distribution of contigs and BUSCO score (Simão et al. 2015).The cumulative contig length graphs of the assembled genomes exhibited a left-skewed shape (Fig. 1b), indicating a higher proportion of longer contig lengths.The contig N50 length of 26.25 Mb for primary genome (consisted of the 28th contig) was comparable to 46.39 Mb, the arithmetic average of chromosome lengths (a total genome size/ the number of chromosomes) of H. sabdariffa.The completeness of genome was further evaluated by BUSCO analysis on the eudi-cots_odb10 database, which represented 99.3% completeness rate for primary genome, 90.1% for Hap 1, and 92.6% for Hap 2 (Fig. 1c), which are comparable with the chromosome-level genomes of other species in Hibiscus genus (Simão et al. 2015;Yang et al. 2022).
The RNA-seq alignment metrics, QV, and LAI were used to assess the quality of genome assemblies.A primary assembly, Hap 1, and Hap 2 represented 97, 87, and 88% of RNA alignment rate, respectively, which were clearly higher than the RNA alignment rate on previous draft genome, 70% (Table 1).QV and LAI were 64.5 and 7.24 for the primary assembly, where 61.5 and 6.73 for Hap 1 and 61.6 and 7.74 for Hap 2, which all demonstrated the high quality of genome assemblies.QV and LAI of the previous draft genome of H. sabdariffa were not estimated, so they were represented as NAs in Table 1.
The tetraploid nature of H. sabdariffa poses a challenge in de novo genome assembly, occasionally leading to misassemblies.Hifiasm is renowned for its ability to preserve the contiguity of haplotypes during genome assembly as a leading haplotype-resolved assembler (Cheng et al. 2021).In this study, we further deciphered both primary assembly and haplotype-resolved assembly.The primary assembly could represent each homologous sequence once and each heterologous "bubble" twice.The primary genome size of H. sabdariffa was 2.36 Gb, which is larger than the haplotype genome (1.67 Gb) but smaller than the diploid genome (3.34 Gb) of H. sabdariffa.However, the lower BUSCO score observed in the haplotigs compared to the primary assembly might be due to the inherent challenges in resolving a complexity in tetraploid genomes.On the other hand, the genome sizes of Hap 1 and Hap 2 were smaller than the theoretical haploid genome size (1.67 Gb vs 1.42 and 1.48 Gb, respectively).This difference can be explained by the fact that each haplotype is a reduced version of the primary assembly of the 2x genome (n = 2x = 36).To enhance the phasing and contiguity of the H. sabdariffa genome, incorporating parental sequencing information and Hi-C data would improve the resolution of haplotypes, leading to a more comprehensive and accurate representation of the genome.It is worthy to mention that despite this limitation, Hifiasm is a viable solution due to its graph-binning strategy and its capacity for constructing haplotype-resolved assemblies (Cheng et al. 2021;Sun et al. 2022a).
We next performed repeat masking with the eudicot repeat library of RepeatMasker.Lower than 4% of genome was annotated as simple and low complexity repeats.On the other hand, after being masked with de novo modeled repeat library, a significant portion of genome was annotated as repeats: 74.48% in the primary genome, 76.67% in Hap 1, and 73.99% in Hap 2 (Fig. 2a and Tables 2 and 3).The Hap 2 genome consists of 42.91% retroelements, 1.10% DNA transposons, and 24.43% unclassified interspersed repeats, where Hap 1 consists of 46.56% retroelements, 0.97% DNA transposons, and 23.05% unclassified interspersed repeats (Table 3).Specifically, more than 40% regions (41.69% in primary assembly, 44.59% in Hap 1, and 41.42% in Hap 2) of genomes were LTR elements, containing Ty1/Copia, Gypsy/DIRS1, and retroviral sequences (Tables 2 and 3).
The telomeric repeats of H. sabdariffa were analyzed using repeat masking results.The telomeric repeat sequence of H. sabdariffa genome was "TTTAGGG," which is known as the canonical repeat sequence of plants (Richards and Ausubel 1988).H. sabdariffa has 72 telomeric regions.Among the 2,602 telomeric repeat regions in primary assembly, 67 ends of contigs were inferred to putative telomere that were positioned at the end of each contig and longer than 1,000 bp (∼140 concatemers; Fig. 2b).In Hap 1 and Hap 2, 47 and 42 ends of contigs were annotated as putative telomeres (Fig. 2b).The longest putative telomeres were 7,916 bp in primary assembly, 6,973 bp in Hap 1, and 7,916 bp in Hap 2 (Fig. 2b).The occurrence of WGD events in H. sabdariffa was estimated through analysis of a Ks plot (refer to Fig. 2c).A clustering method revealed the possibility of 2 or 3 instances of WGD in H. sabdariffa.The presence of an elevated number of genes in the primary genome assembly of H. sabdariffa suggests the potential occurrence of relatively recent WGD events.We produced about 26 Gb of total RNA sequencing reads, of which 94.8% were higher than Q30 quality (Table 1).Each soft-masked genome was gene-annotated hinted by aligning those total RNA sequencing data onto each genome.A total of 154,226 genes in primary assembly, 79,143 genes in Hap 1, and 87,414 genes in Hap 2 were annotated (Table 1).The mean and N90 length of genes were 1,033 and 480 bp for primary assembly, 1,074 and 510 bp for Hap 1, and 1,099 and 525 bp for Hap 2, respectively (Table 4).They are comparable to the number of proteincoding genes, which are 87,603, and have an average CDS length of 1,188 base pairs in H. syriacus, implying the validity of the annotated genes.Each annotated gene set was translated into protein sequence and validated by protein BUSCO with the same database to genome BUSCO (eudicots_odb10 database).The completeness rates were 99.3% for primary assemble genome, 88.9% for Hap 1, and 92.0% for Hap 2 (Table 4).The RNA-seq read alignment rate onto the genic region of each genome was calculated by using BEDTools.A total of 94.8, 95.5, and 95.0% of RNA reads are aligned onto the genic region of the primary, Hap 1, and Hap 2 genomes, respectively.

SNP and SV calling
SNPs in primary, Hap 1, and Hap 2 genomes were analyzed for measuring genomic variation.The impact of SNP on transcripts was classified into 3 categories: high, moderate, and low following the criteria of SnpEff (Cingolani et al. 2012).In the primary genome, we found 496 high-impact, 10,196 moderate-impact, and 6,147 low-impact SNPs and ∼6,000 high-impact, ∼300,000 moderate-impact, and ∼400,000 low-impact SNPs in the Hap 1 and Hap 2 genomes (Fig. 3b).Subsequently, the SVs between Hap 1 and Hap 2 genomes were called and their intersections with CDSs were analyzed to find out gene sets that were affected by these SVs.Hap 2 genome contained 341 deletions, 194 insertions, and 75 presence-absence variants on corresponding number of genes (Fig. 3a).Given the redundancy observed in tetraploid organisms, these genes might have the potential for neofunctionalization.

Analysis of putative biosynthetic genes
All genes were functionally annotated based on similarity with the universal protein knowledgebase (The UniProt Consortium 2017).The taxonomic descriptions of proteins were further utilized to select candidate genes that are responsible for biosynthesis of roselle-enriched natural products: phenylpropanoids and flavonoids (Da-Costa-Rocha et al. 2014;Kim et al. 2021).The biosynthetic pathways of key intermediates involved in phenylpropanoids and flavonoids have been well characterized (Dewick 2002;Kanehisa and Goto 2000).Cinnamoyl-CoA and p-coumaroyl-CoA are the core frameworks of phenylpropanoids, and kaempferol and dihydrotricetin are known as the core intermediates of flavonoids.The candidate genes responsible for producing those intermediates were conserved in the H. sabdariffa genome (Fig. 4; Supplementary Data 2).The expression level of those candidate genes was represented by TPM in Fig. 4 and Supplementary Data 4. To understand the biosynthetic pathway of roselle-enriched natural products, the first step would involve experimentally validating the catalytic activity of candidate genes.Furthermore, those genes could be used as baits in coexpression analysis to identify uncharacterized biosynthetic genes for roselle-enriched natural products such as hibiscetin (Fig. 4  Glycosylation stands out as a crucial modification in the diversification of natural products.Medicinal natural products found in H. sabdariffa, such as quercetin-O-glucuronide and delphinidin-3-sambubioside, are known for their functions and are synthesized through glycosylation of intermediate flavonoids (Juliani et al. 2009;Herranz-López et al. 2020).However, the process of glycosylation remains elusive in both feasible chemical synthesis and known biosynthesis pathways.In our investigation, we identified 91 genes associated with flavonoid-O-glycosylation and assessed their expression levels in leaves (Supplementary Data 5).Only 6 genes represented TPM higher than 5.These genes present promising candidates in the search for novel and valuable glycosylation enzymes, potentially contributing to the production of medicinal natural products.The candidate biosynthetic gene clusters that include flavonoid glycosylation have been investigated in Hap 2 genome (Supplementary Data 6).
L-Tyrosine (Tyr) serve as a primary metabolite and a fundamental precursor for the phenylpropanoid and flavonoid pathways (Fig. 4).The biosynthesis of Tyr involves 2 distinct pathways: the arogenate dehydrogenase (referred to as ADH, TyrA a ) and the prephenate dehydrogenase (referred to as PDH, TyrA p ) pathways (Schenck et al. 2015).These early primary metabolism pathways, comprising the key regulatory enzymes ADH and PDH, are critical in modulating carbon flux toward Tyr biosynthesis (Schenck et al. 2015).This intricate pathway exhibits evolutionary variation among diverse plant lineages, necessitating thorough investigation.The predicted number of genes for encoding TyrA proteins was 10, 14, and 4 for H. sabdariffa, H. syriacus, and H. trionum, respectively.Employing a maximum likelihood phylogenetic analysis, the protein sequences of the candidate genes were compared with TyrA proteins from other organisms (Fig. 5; Supplementary Data 3).HsADH1-3, HsADH9, HsyADH3, and HtADH1 formed a clade with AtADH1.Similarly, HsADH4, HsADH6, HsADH7, HsyADH4-6, and HtADH2-4 clustered with AtADH2 (Fig. 5).Given the in vitro biochemical experimental validation of AtADH1 and AtADH2 (Rippert andMatringe 2002b, 2002a), these enzymes emerge as primary targets for subsequent validation as representatives of ADH activity.
MtPDH1, GmPDH1, and GmPDH2 formed a distinct clade alongside other ADHs; these enzymes exhibit PDH activity in addition to the conventional ADH activity, which has only been detected in some legumes (Schenck et al. 2015).HsyADH1, HsyADH2, and HsADH5 also formed unique clades with other TyrA enzymes.This indicates the need for biochemical validation of the catalytic potential of HsyADH1, HsyADH2, and HsADH5 in the production of 4-hydroxyphenylpyruvic acid from prephenate.Validating catalytic functions of these enzymes would provide crucial insights.We did not identify a gene associated with legume ADH in H. trionum.It is worth noting that many plants lack legume class ADH genes, which are for preventing negative feedback by phenylalanine.While our data do not provide a complete explanation, we speculate that the regulation of ADH may hold less significance in H. trionum, or alternatively, H. trionum may have evolved distinct mechanisms for upregulating the phenylpropanoid pathway.This phylogenetic approach holds promise for unraveling the biosynthetic pathway of roselle-enriched natural products within H. sabdariffa.

Comparative genomics
A genome-wide synteny analysis was conducted on the genomes of Hibiscus genus organisms, using shared BUSCO genes (Fig. 6a  and b).The resulting Circos plot illustrates the positions of shared BUSCO genes and the physical relationships between each assembled unit in H. sabdariffa and H. trionum (Fig. 6a) as well as in H. sabdariffa and H. syriacus (Fig. 6b).The linkage revealed the physical synteny between the genomes of these species.Notably, Scaffold_00004 of H. sabdariffa Hap 2 genome exhibited synteny with chromosome 3 of the H. trionum genome (Fig. 6a).These findings not only confirm the integrity of our assembled genome but also provide resources for comparative genomic analysis within the Hibiscus genus.
Phylogenomic analysis was carried out on the Malvaceae family and A. thaliana, incorporating the assembled Hap 2 genome of H. sabdariffa.The results demonstrated that A. thaliana was as an outgroup, forming a distinct clade, while the genera Gossypium, Corchorus, and Hibiscus formed as distinct entities in the phylogenetic tree (Fig. 6c).Noteworthy is the consistency of our findings with chloroplastic genomic sequence-based phylogenetic analyses of the Gossypium genus (Wu et al. 2018).This analysis identified a distinct genetic cluster comprising G. anomalum, G. austral, G. stocksii, and other Gossypium species, which supports our results (Fig. 6c).Additionally, H. trionum and H. sabdariffa formed a clade, and C. capsularis and C. olitorius constituted another clade, both distinct from the Gossypium clade (Fig. 6c).The phylogenomic analysis could serve as a foundational resource for comparative genomics within the Malvaceae family.

Conclusion
Roselle is an important medicinal plant producing diverse natural products.However, the lack of genomic resource had limited the biological understanding of roselle and its pharmacological applications.In this study, we report the high-quality genome of H. sabdariffa and putative genes related with its metabolism.Our results could be a foundation for understanding genetic evolution and metabolism in roselle.

Fig. 1 .
Fig. 1. a) The read length distribution of H. sabdariffa genomic DNA sequenced by PacBio HiFi sequencing.b) The cumulative length of contigs of haplotypes in descending order.Dotted lines represent the numbers of contigs in primary assembly and Hap 1. c) The BUSCO scores of primary assembly, Hap 1, and Hap 2 genomes using eudicots_odb10 database.

Fig. 2 .
Fig. 2. a) The composition of repeat components in H. sabdariffa genome.The lengths of repeat components in each genome are described in a bar plot: retroelements, DNA transposons, rolling-circles, unclassified, small RNA, satellites, simple repeats, and low complexity.b) The telomeric repeat distribution in H. sabdariffa genome.Dots above the mean length lines represent the candidate telomeres.c) A Ks plot of H. sabdariffa primary genome.Three Ks distribution peaks were identified.

Fig. 3 .
Fig. 3. a) The number of genes related to SVs between Hap 1 and Hap 2. b) The number of SNPs related to genes in primary assembly, Hap 1, and Hap 2 genomes.The effects of SNP were classified with high, moderate, and low, which are defined by SnpEff.

Fig. 6 .
Fig. 6.Comparative genomic analysis of the genome of H. sabdariffa.Genome-wide synteny analysis of the genome of H. sabdariffa compared with a) H. trionum and b) H. syriacus, respectively.Left of a and b, H. sabdariffa; right of a, H. trionum; and right of b, H. syriacus.c) Phylogenomic analysis among the Malvaceae species and A. thaliana.

Table 1 .
The raw sequencing data and assembled genomes of H. sabdariffa.

-read genomic DNA sequencing Type Read count Yield (bp) Mean length (bp) Mean Q score
NA, QV and LAI of the previous draft genome of H. sabdariffa were not estimated.

Table 2 .
Repetitive element annotations of H. sabdariffa primary assembly.